Exploratory risk prediction of type II diabetes with isolation forests and novel biomarkers

Type II diabetes mellitus (T2DM) is a rising global health burden due to its rapidly increasing prevalence worldwide, and can result in serious complications. Therefore, it is of utmost importance to identify individuals at risk as early as possible to avoid long-term T2DM complications. In this study, we developed an interpretable machine learning model leveraging baseline levels of biomarkers of oxidative stress (OS), inflammation, and mitochondrial dysfunction (MD) for identifying individuals at risk of developing T2DM. In particular, Isolation Forest (iForest) was applied as an anomaly detection algorithm to address class imbalance. iForest was trained on the control group data to detect cases of high risk for T2DM development as outliers. Two iForest models were trained and evaluated through ten-fold cross-validation, the first on traditional biomarkers (BMI, blood glucose levels (BGL) and triglycerides) alone and the second including the additional aforementioned biomarkers. The second model outperformed the first across all evaluation metrics, particularly for F1 score and recall, which were increased from 0.61 ± 0.05 to 0.81 ± 0.05 and 0.57 ± 0.06 to 0.81 ± 0.08, respectively. The feature importance scores identified a novel combination of biomarkers, including interleukin-10 (IL-10), 8-isoprostane, humanin (HN), and oxidized glutathione (GSSG), which were revealed to be more influential than the traditional biomarkers in the outcome prediction. These results reveal a promising method for simultaneously predicting and understanding the risk of T2DM development and suggest possible pharmacological intervention to address inflammation and OS early in disease progression.

Another significant challenge in T2DM risk prediction is the black box nature of many of the proposed models.Black-box models including Random Forest (RF), artificial neural networks (ANN), and support vector machine (SVM) are the most frequently used models for T2DM risk prediction [16][17][18] .However, they inherently lack interpretability and are rarely explained adequately to assist medical professionals in decision-making 19 .Various studies investigating the prediction of diabetes have incorporated explainability modules, including Shapley additive explanations (SHAP) 29,40,41 and local interpretable model-agnostic explanations (LIME) 29,41,42 .These studies only incorporated basic clinical and demographic variables such as age, BMI and glucose levels, which are useful for early prediction but do not provide insights for potential targets for the prevention of T2DM.Limited clinical and demographic variables can identify individuals at high risk of developing T2DM but fail to reveal the underlying mechanisms or causal factors contributing to disease onset.Consequently, there is a need to integrate a wider range of biomarkers which could offer deeper insights into the etiology of T2DM and support the formulation of effective prevention strategies.
The main aim of this study was to perform exploratory risk prediction of T2DM with biomarkers of inflammation, OS and MD using OCC in the presence of scarce data.Given that these biomarkers are not routinely assessed as part of standard clinical practice, data availability is a major challenge, particularly the presence of positive samples (patients developing T2DM in this case).Hence, the use of OCC is particularly valuable in this context as it allows for effective modelling and prediction even when positive instances are rare, thereby providing a robust framework for early identification of at-risk individuals despite limited data.Thus, the present work provides a two-fold novelty.First, to our knowledge, the aforementioned biomarkers have not been incorporated in the context of ML, and second, OCC, including iForest, has not been utilised for the task of early T2DM risk estimation.

Dataset, participants, and sample collection
The subjects in this study were attendees of a rural diabetes screening clinic at Charles Sturt University (Diab-Health), Albury, Australia between the years 2002 to 2015.A total of 2716 entries were obtained from 850 patients, with information on more than 180 attributes.Subjects were included if they initially presented without T2DM, and data of a subsequent visit 2-4 years later was available for longitudinal analysis that identified progression to T2DM in a subsection of the cohort.Participants were classified as having developed T2DM if they reported a diagnosis of T2DM, were on glucose-lowering medication or had a fasting BGL ≥ 7 mmol/L following the initial screening 2-4 years prior.Inclusion and exclusion of participants is clarified in Fig. 1.
HbA1c was not considered a diagnostic criterion in our study due to missing values and differing methods used to obtain HbA1c values.Some of the data were obtained as a point of care testing (POCT), and different laboratories were also used for some of the entries, which may pose a problem due to a lack of HbA1c standardization 43,44 .Additionally, HbA1c is sensitive to changes in red blood cell cycles, including vitamin B-12 deficiency, which can produce falsely elevated levels 45 .
Based on these criteria, a total of 324 participants were included, with 17 developing T2DM, and 307 remaining as controls.The incidence is slightly higher than the Australian data in the current cohort, being approximately 5% versus the Australian 0.3% 46 .Body mass index (BMI), blood pressure measurements, and lipid profiles including high-density lipoprotein (HDL), low-density lipoprotein (LDL), triglycerides, and total cholesterol (TC) were measured as detailed in 47 .Blood and urine samples were collected and prepared for measurement of biomarkers of OS, inflammation, MD, and hemostasis according to the methodology detailed in 8 and 10 .BGL was determined from finger prick POCT.The study was approved by the Charles Sturt University Human Research
• Traditional features fasting BGL, BMI, and triglycerides.The value of these three predictors has been shown previously [50][51][52] , and were therefore considered for this study.

Isolation Forest (iForest) algorithm
iForest is an unsupervised, binary tree anomaly detection algorithm developed by Liu et al. 53 .Conceptually, anomalies are those data points that require shorter depths, or path lengths, to be isolated from the majority of other points during successive splitting of the dataset using an ensemble of isolation trees, as can be seen in Fig. 2.
Given n points, the anomaly score s for point x can be calculated using the following equation: where h(x) is the path length for a single isolation tree, E(h(x)) is the average h(x) for datapoint x across the ensemble of isolation trees, and c(n) is the average path length for a dataset of n points, which is used for normalization purposes.

Modelling and evaluation
Baseline models were trained using only traditional biomarkers to assess the performance improvement when adding novel biomarkers of OS, inflammation and MD.Given that iForest is a black-box model, Depth-based Isolation Forest Feature Importance (DIFFI) was employed to add explainability to the model and identify the most influential predictors.Experiments were carried out to assess the effectiveness of iForest for the predictive classification task using only traditional features initially, and to assess the change in model performance when adding the additional predictor variables discussed earlier.This created two distinct models for comparison: iForest with traditional www.nature.com/scientificreports/features only, and iForest with all predictor features.For comparison, an additional three RF models were also trained using three oversampling techniques, including SMOTE, Borderline SMOTE and ADASYN.To mitigate overfitting due to the small number of positive samples, recursive feature elimination was performed to keep only the top 10 features of the aforementioned 17 biomarkers.
The data was split into training and testing sets using a 70:30 ratio.To assess the stability of the results, 10 iterations of this split were employed and evaluated, and the mean and standard deviations (SD) of the model evaluation metrics were computed, followed by computing the coefficients of variation (SD/mean).
Min-max normalization was applied to all features to scale values within the range [0,1] using the following formula, where X represents each feature: All experiments were carried out in Python 3.11.5.Scikit-learn 1.2.2. was used to implement RF and iForest using default parameters except for the contamination parameter, which is the expected proportion of anomalies in the dataset, and was set to 0.05 representing an expected 5% of G-T2DM in the dataset.All models were assessed using recall, precision, F-1 score and accuracy, which are defined through the following equations: where TP represents true positives, FP false positives, TN true negatives and FN false negatives.The mean and SD were computed for the metrics over the 10 iterations for all models.

Model explainability
To provide much needed explainability to the black-box iForest model, Depth-based Isolation Forest Feature Importance (DIFFI) was added to the analysis pipeline.Described in 54 , DIFFI assigns higher feature importance scores to features that induce higher imbalance in favor of anomalous points, and those that isolate anomalies at shallower depths.Global Feature Importance (GFI) scores are computed by updating cumulative feature importance scores according to the depth of the splitting feature and the induced imbalance at each node, termed the Induced Imbalance Coefficient 54 .
To compute GFI across the 10 iForest models (corresponding to the 10 dataset splits), the scores are aggregated for all models.Features (p) are then ranked in decreasing order according to their cumulative DIFFI score.An additional quantity is then added according to feature rank r to further differentiate the most important from the least important features: The details of the implementation of global DIFFI computation and feature ranking are provided in the original work by Carletti et al. 54 .

Demographic and clinical characteristics
Our data consists of 324 participants divided into two groups according to glycemic outcome, with Tables 1 and  2 presenting categorical and numerical baseline characteristics, respectively.The first group remained as controls (G-Controls), while the second group progressed to T2DM (G-T2DM).Regarding participants in G-Controls, 23.5% (72/307) were in the prediabetic stage (5.5 < BGL < 7 mmol/L) at baseline, while 47.1% (8/17) were prediabetic in G-T2DM.No significant differences were found between G-Controls and G-T2DM in terms of age, gender, hypertension status, smoking, alcohol consumption, cardiovascular disease, and statin use.However, BMI was significantly higher in G-T2DM (p < 0.001).

Blood and urinary biomarkers
Table 2 also summarizes inferential statistics on the blood and urinary biomarkers of participants.As expected, the group of participants in G-T2DM had significantly higher baseline levels of HbA1c and BGL.In terms of lipid profile, the two groups were matched except for triglycerides, which was significantly higher in G-T2DM (p = 0.01).Inflammatory biomarkers indicated elevated levels of inflammation in G-T2DM as revealed by significantly higher levels of IL-6 and IL-10.However, the remaining biomarkers of inflammation (MCP-1, CRP, IL-1β, IGF-1) did not reveal any significant differences.No significant differences were found for the mitochondrial biomarkers (HN, MOTS-c, and P66Shc), OS biomarkers (8-isoprostane, 8-OHdG, GSSG) and biomarkers of hemostasis (C5a and D-dimer).However, inflammatory, OS and MD features played a significant role in predicting the risk of T2DM as discussed below.

iForest performance evaluation and comparison with oversampling techniques
Figure 3 summarizes the mean ± SD of the evaluation metrics obtained across the tenfold cross validation.Augmenting traditional biomarkers of BGL, BMI and LDL with biomarkers of inflammation, OS, and MD improved model performance across all metrics.The biggest performance boost was seen in the model recall, which increased from 0.57 ± 0.06 to 0.81 ± 0.08.Accuracy increased from 0.84 ± 0.02 to 0.91 ± 0.03, F1-score from 0.61 ± 0.05 to 0.81 ± 0.05, and precision from 0.67 ± 0.09 to 0.82 ± 0.11.Accordingly, the coefficients of variation were 3.3% for model accuracy, 9.9% for recall, 13.4% for precision and 6.2% for F1-score.In comparison, the RF models with the oversampling techniques all performed poorly, particularly in terms of precision, which is displayed in Table 3.

Discussion
The objective of this study was to explore the performance of biomarkers of OS, inflammation, and MD for the prediction of T2DM occurrence with a highly imbalanced dataset utilizing iForest, an anomaly detection algorithm.By leveraging DIFFI scores, explainability was achieved for the black box model, and important features for the prediction were identified that can be of clinical value for selecting appropriate treatment.According to glycemic outcome, participants were divided into two groups: Those remaining as controls (G-Controls) and those progressing to T2DM (G-T2DM).One-quarter of participants in the control group were prediabetic at baseline, and more than one-third of those developing T2DM were not prediabetic.This highlights the need for measures beyond BGL to better monitor and predict the development of this disease.
The dataset used in this study suffered from a class imbalance (ratio < 1:15), with those presenting with T2DM as the minority class at only 5.5% of participants.This imbalance was addressed by utilizing anomaly detection rather than traditional binary ML techniques.iForest models were trained with two sets of features for predicting the risk of T2DM development.One set consisted of only traditional biomarkers (BGL, BMI, triglycerides), and the second included both traditional and new biomarkers (OS, inflammation and MD).Additionally, the iForest method was compared with various oversampling techniques to assess the utility of OCC in the presence of a small sample of positive cases for model training.
Baseline BMI, IL-6, and IL-10 were significantly higher for participants in G-T2DM.IL-6 is an inflammatory cytokine that was previously found to be increased in individuals with T2DM 55 , and increased levels of IL-6 in adipose tissue have been linked to insulin resistance 56 .In obese individuals, the release of non-esterified fatty acids from adipose tissue is believed to contribute to insulin resistance and β-cell dysfunction, with the term diabesity coined to illustrate the tight association between obesity and T2DM 57,58 .However IL-6 can also have an anti-inflammatory effect and improves glucose metabolism 59,60 .Hence, models that are based on single features may not identify the complex feature interaction.Activity of IL-6 may be further concentration dependent, which activates different signaling pathways including reactive oxygen specifies reduction 61 .
iForest models outperformed RF with oversampling across all metrics except for accuracy, however, this was due to the latter models' bias towards predicting the negative class.These results indicate the advantage of employing the OCC technique in the case of data scarcity, particularly when the features of interest are not routinely collected or are expensive to obtain 33 .
The inclusion of biomarkers of OS, inflammation and MD improved the performance across all metrics in comparison to predictive modelling with only traditional biomarkers of BGL, BMI and triglycerides.The greatest boost in performance was observed for recall and F1-scores.This is particularly important given the higher cost of missing future cases of T2DM as opposed to predicting false positives, considering that interventions mainly consist of lifestyle changes.Furthermore, the coefficients of variation for the evaluation metrics indicated low to moderate variability, with values below 10% for accuracy, F-1 score, and recall indicating good stability for the trained model.
The top five predictors in terms of DIFFI scores were IL-10, 8-isoprostane, GSSG, HN and P66Shc, while the lowest scores were obtained by BGL and triglycerides, further highlighting the potential role of these novel biomarkers for ML prediction of T2DM development.
The anti-inflammatory IL-10 is generally hypothesized to play a protective role in T2DM 62 , and IL-10 gene polymorphisms have been suggested for T2DM screening 63,64 .IL-10 is believed to improve glycemic control through its immunomodulatory effects by inhibiting cytokine production 65 .This is in line with the results of our study, where significantly lower levels of IL-10 were observed in G-T2DM.
8-isoprostane, a biomarker of lipid peroxidation, has shown varying efficiency as a biomarker for prediabetes 10,66 .However, the present study agrees with the results reported by Schöttker et al. 67 , in which higher levels of 8-isoprostane were associated with the incidence of T2DM in participants 65 years of age or older.Given that the median age for patients developing T2DM in our study is 65, the reduced tolerance for OS with age would also be apparent.
GSSG is the oxidized form of GSH, an antioxidant defense system primarily stored and released from erythrocytes 68 .GSH is converted to GSSG in the presence of free radicals, and in individuals with T2DM, regeneration of GSH from GSSG is impaired because of insufficient factors necessary for this conversion.Furthermore, increases in free radical production as part of T2DM progression associated with increased BGL, in turn, activates the GSH scavenger, producing higher levels of GSSG 56 .Hence, the combined action of meta-inflammation and GSH antioxidant activity indicates the interactive role of diverse biomarkers in possibly mitigating disease progression that can lead to a novel treatment pathway for T2DM in conjunction with traditional clinical options.
HN is a MDP that plays a key role in metabolism and insulin sensitivity 69,70 .Voight and Jelinek 8 found decreased levels of HN in prediabetic patients, given that HN has an important role in glucose metabolism through its antiapoptotic and antioxidant functions 71 .Conversely, P66Shc, a Shc protein that modulates OS and promotes apoptosis, has been implicated in T2DM development and progression through its association with pancreatic β-cell dysfunction and suppression of insulin signaling 72,73 .
Our results indicate important interactions between inflammatory and OS biomarkers associated with T2DM progression over time and highlight the lesser role of traditional features.To gain a better understating of the specific interactions between these biomarkers a larger number of participants is required in order to obtain performance metrics and feature importance scores that increase the reliability of our results.Furthermore, a larger dataset would allow for appropriate hyperparameter tuning to be carried out to optimize the results further.Additionally, the possible change in data distribution introduced by missing data imputation may have impacted subsequent ML pipelines and feature importance.Finally, due to data scarcity, the selected cohort included all participants without T2DM, which should be investigated in future studies with the availability of a larger and more specific cohort to provide targeted insights.

Conclusion
Based on the results of this study, various conclusions can be inferred.First, typical monitoring of T2DM risk through BGL may not provide a comprehensive picture of T2DM disease progression.Influential biomarkers identified were IL-10, 8-isoprostane, GSSG, HN and P66Shc, revealing the potential for biomarkers of inflammation, OD and MD to serve as a guide for targeted, personalized intervention in the prevention of T2DM incidence.

Figure 1 .
Figure 1.Flowchart for inclusion and exclusion of subjects.

Figure 2 .
Figure 2. Figures a and bdepict the classification of datapoints using iForest.The yellow point is an inlier, while the red point is an outlier, or an anomaly.(a) iForest uses random splits across dimensions in the data, and as depicted, fewer partitions are required to isolate the outlier (red) when compared with the inlier (yellow).In (b), the outlier is isolated closer to the root node, requiring a shorter depth or path length.

Feature importance with
depth-based isolation forest feature importance (DIFFI)Global feature importance scores obtained through DIFFI are summarized in Fig.4.The five most important features were IL-10, 8-isoprostane, GSSG, HN and P66Shc.Traditional biomarkers of BGL and triglycerides were the least important features overall, whereas BMI was only in 10th place out of 17 features.

Figure 3 .
Figure 3. Mean values and standard deviation bars of performance metrics, Accuracy (Acc), F1-score, precision and recall for the two iForest models evaluated across ten folds, one with conventional biomarkers only, and the second using conventional and novel biomarkers of inflammation, OS and MD.

Table 1 .
Descriptive statistics of the study participants at baseline.Values are described as numbers (%).Significant (p < 0.05) differences were detected using χ 2 tests.

Table 2 .
Descriptive statistics of the study participants at baseline.Values are described as median (Q1-Q3).Significant (p < 0.05) differences were detected using Wilcoxon rank sum tests.